This is the preliminary code of an ecommerce project done for an online vender. The Dataset, final code and report can't be made public. - Robin
setwd("C://Users/robin/Projects/e-commerce/example_data")
# Load dataset
data <- read.csv("ecommerce_dataset.csv", sep = ";") # Delimiter in the CSV file is a semi-colon
# Load packages
library(dplyr)
library(lubridate)
library(ggplot2)
library(agricolae)
Warning message:
"package 'dplyr' was built under R version 3.5.3"
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
Warning message:
"package 'lubridate' was built under R version 3.5.3"
Attaching package: 'lubridate'
The following object is masked from 'package:base':
date
Warning message:
"package 'ggplot2' was built under R version 3.5.3"Warning message:
"package 'agricolae' was built under R version 3.5.3"
We'll first see a glimpse/summary of the dataset.
glimpse(data) # To see the variables and their types
Observations: 400,637 Variables: 13 $ Timestamp <int> 1460462764, 1490572501, 1490572501, 1490572501... $ User.ID <fct> NULL, 0c6205543f3f6a7e8082, 0c6205543f3f6a7e80... $ Cross.device.user.ID <fct> NULL, NULL, NULL, NULL, NULL, NULL, NULL, c05f... $ Existing.client <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0... $ Event.type <fct> Basket, Listing, Listing, Listing, Listing, Li... $ Product.ID <fct> Unknown, Unknown, Unknown, Unknown, Unknown, U... $ Product.Price <fct> 11, NULL, NULL, NULL, NULL, NULL, NULL, NULL, ... $ Product.quantity <fct> 1, NULL, NULL, NULL, NULL, NULL, NULL, 1, NULL... $ Product.category <fct> Medium Items, Packages, Packages, Packages, Pa... $ Environment <fct> web, web, web, web, web, web, web, web, web, w... $ Device.type <fct> Desktop, Desktop, Desktop, Desktop, iPhone, iP... $ Browser.family <fct> safari, edge, edge, edge, mobile safari, mobil... $ User.location <fct> Inland, Outside, Outside, Outside, Inland, Inl...
Variables 'User.location' and 'Browser.family' were not considered in this analysis, and
$$\text{Sales to Visits} = \frac{\text{Number of Sales}}{\text{Number of visits to pages containing products of same kind}}$$We'll look for duplicates in data and remove them.
nrow(distinct(data)) # Get distinct instances/rows
nrow(data) - nrow(distinct(data)) # Get duplicate rows
data <- distinct(data)
Now we'll get date and times from timestamps.
data$Date <- as.POSIXct(data$Timestamp, origin="1970-01-01") # Convert timestamps into date object
head(data$Date)
[1] "2016-04-12 14:06:04 CEST" "2017-03-27 01:55:01 CEST" [3] "2017-03-27 01:38:06 CEST" "2017-03-27 01:22:51 CEST" [5] "2017-03-27 01:00:49 CEST" "2017-03-27 01:00:29 CEST"
tail(data$Date)
[1] "2016-12-07 13:35:54 CET" "2016-12-07 13:39:41 CET" [3] "2016-12-07 12:30:11 CET" "2017-02-08 10:32:22 CET" [5] "2017-02-08 10:27:08 CET" "2017-01-25 10:21:47 CET"
We'll order the dataset by date.
data <- data[order(data$Date),]
head(data$Date); tail(data$Date) # See start and end dates of sorted data
[1] "2016-03-28 02:17:38 CEST" "2016-03-28 02:30:53 CEST" [3] "2016-03-28 02:34:44 CEST" "2016-03-28 03:10:31 CEST" [5] "2016-03-28 03:21:53 CEST" "2016-03-28 03:22:50 CEST"
[1] "2017-04-17 00:45:17 CEST" "2017-04-17 00:45:26 CEST" [3] "2017-04-17 00:45:40 CEST" "2017-04-17 00:53:18 CEST" [5] "2017-04-17 00:55:07 CEST" "2017-04-17 01:18:13 CEST"
Now our data is ordered by time. CEST represents Central European Summer Time.
paste('Start time:', min(data$Date)); paste('End time:', max(data$Date))
# Divide date into months, years, weekdays, hours etc.
data$Month <- month(data$Date, label=TRUE); head(data$Month); data$Year <- year(data$Date); head(data$Date)
data$weekday <- wday(data$Date, label=TRUE); head(data$weekday); data$time <- format(data$Date, "%H:%M:%S"); head(data$time)
data$hour <- hour(data$Date); head(data$hour); data$YearMonth <- paste(data$Year, data$Month); head(data$YearMonth)
[1] "2016-03-28 02:17:38 CEST" "2016-03-28 02:30:53 CEST" [3] "2016-03-28 02:34:44 CEST" "2016-03-28 03:10:31 CEST" [5] "2016-03-28 03:21:53 CEST" "2016-03-28 03:22:50 CEST"
So, we have data available from 28 March 2016 to 17 April 2017
Now, we'll convert Columns - 'Product.Price' and 'Product.quantity' into numerical variables. Currently they are factors (categorical).
head(data$Product.Price)
data$Price <- as.numeric(levels(data$Product.Price)[data$Product.Price]); head(data$Price)
Warning message in eval(expr, envir, enclos): "NAs introduced by coercion"
data$Quantity <- as.numeric(levels(data$Product.quantity)[data$Product.quantity]); head(data$Quantity)
Warning message in eval(expr, envir, enclos): "NAs introduced by coercion"
NA's will be introduced by coercion inplace of "NULL" level of the original factor.
data$Revenue <- data$Price*data$Quantity; head(data$Revenue)
Now we'll get data specific to sales.
levels(data$Event.type) # To get levels of Events.
Four events:
# Number of transactions
sdata <- data[data$Event.type=="Sales",]
paste("Total number of transactions is:", nrow(sdata))
# Proportion of sales data of total data
sales.prop <- nrow(sdata)*100/nrow(data); round(sales.prop,2)
Only 5.86 % of the data is of sales
Getting total revenue
# Compute Revenue
revenue16.17 <- sum(sdata$Revenue); paste("Total revenue is:", revenue16.17)
This is the total revenue in the period 28 March 2016 to 17 April 2017.
Now, we'll try to see how revenue changed over time.
options(repr.plot.width=8, repr.plot.height=3)
sdata %>%
group_by(Date) %>%
summarise(r = sum(Revenue)) %>%
ggplot(aes(x = Date, y = r)) + geom_line() + geom_smooth(method = 'auto', se = FALSE) + labs(x = 'Date', y = 'Revenue (Euro)', title = 'Revenue by Date')
`geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
# Number of transactions by User's location
summary(sdata$User.location)
x <- sum(sdata$Revenue[sdata$User.location=="Inland"])
y <- sum(sdata$Revenue[sdata$User.location=="Outside"])
paste("Revenue generated from Inland Users is: ", x)
paste("Revenue generated from Outside Users is: ", y)
Sales to Visits ratio is comparable for Inland and Outside users.
# Number of transactions by date
sdata$dummy[sdata$Event.type=="Sales"] <- 1 # Create a dummy variable having 1's indicating Sales
options(repr.plot.width=8, repr.plot.height=3)
sdata %>%
group_by(Date) %>%
summarise(t = sum(dummy)) %>%
ggplot(aes(x = Date, y = t)) + geom_line() + geom_smooth(method = 'auto', se = FALSE) + labs(x = 'Date', y = 'Number of transactions', title = 'Number of transactions by Date')
`geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
We'll get top 5 days for revenue and also transactions.
date.rev <- sdata %>%
group_by(Date) %>%
summarise(revenue = sum(Revenue)) %>%
arrange(desc(revenue))
date.tra <- sdata %>%
group_by(Date) %>%
summarise(transactions = sum(dummy)) %>%
arrange(desc(transactions))
topdate.rev <- top_n(date.rev,5, date.rev$revenue); topdate.rev
topdate.tra <- top_n(date.tra,5, date.tra$transactions); topdate.tra
| Date | revenue |
|---|---|
| 2017-04-14 14:43:44 | 427 |
| 2017-03-31 16:31:55 | 425 |
| 2017-01-21 04:20:05 | 320 |
| 2016-12-30 14:24:41 | 317 |
| 2017-02-18 12:45:25 | 293 |
| Date | transactions |
|---|---|
| 2016-12-10 21:19:01 | 7 |
| 2016-06-27 01:38:55 | 6 |
| 2016-07-04 00:26:55 | 6 |
| 2016-11-02 12:07:09 | 6 |
| 2017-01-19 22:15:32 | 6 |
| 2017-01-21 04:20:05 | 6 |
| 2017-01-30 17:36:42 | 6 |
Table above shows instances of Sales with most revenue in a single day.
All of these values belong to phase (Nov - Early March).
For December, November and January, high revenue generated is clearly due to high amount of transactions taking place during this period.
For 31 March and 14 April 2017, the reason for high amount of revenue is not higher number of transactions.
We'll see all Revenue data from these top 5 Revenue dates.
p = list()
for (i in 1:nrow(topdate.rev)){
p[[i]] = sdata[sdata$Date==topdate.rev$Date[i],]$Revenue
}
names(p) = paste("Revenue at ", topdate.rev$Date)
p
q = list()
for (i in 1:2){
q[[i]] = sdata[sdata$Date==topdate.rev$Date[i],]$User.ID
}
q
As it turns out, the spike for those individual days in April and March was due to a single expensive purchase from user IDs d926128c491174a3f549 and 6a7410d13d7dc7549bdb
We'll explore User ID based data later on in detail.
Now we know the reason of high revenue two days in April 2017 and March 2017.
sdata.new <- sdata[!sdata$YearMonth %in% c("2016 Mar", "2017 Apr"),]
month.rev <- sdata.new %>%
group_by(YearMonth) %>%
summarise(revenue = sum(Revenue)) %>%
arrange(desc(revenue))
month.rev$Revenue.pct = round(100*month.rev$revenue/sum(month.rev$revenue),2)
colnames(month.rev)[3] = 'Percentage'
month.rev
| YearMonth | revenue | Percentage |
|---|---|---|
| 2016 Dec | 88297 | 19.61 |
| 2016 Nov | 60391 | 13.41 |
| 2017 Feb | 47338 | 10.51 |
| 2017 Jan | 38067 | 8.45 |
| 2016 Sep | 37016 | 8.22 |
| 2016 Oct | 35831 | 7.96 |
| 2016 Apr | 31927 | 7.09 |
| 2017 Mar | 29050 | 6.45 |
| 2016 May | 24030 | 5.34 |
| 2016 Aug | 21807 | 4.84 |
| 2016 Jun | 19323 | 4.29 |
| 2016 Jul | 17169 | 3.81 |
month.rev$levels = factor(month.rev$YearMonth, levels= month.rev$YearMonth[c(7,9,11,12,10,5,6,2,1,4,3,8)])
ggplot(month.rev, aes(x = levels, y = revenue)) + geom_col() + labs(x = 'Month', y = '% Share of Revenue', title = '% Share of Revenue by month') + geom_text(aes(label= Percentage),position=position_stack(vjust = 0.5))
It is clear that most of the revenue (Around 50%) was generated during Nov 2016 - Feb 2017.
Since the dataset is from retail, this behavior maybe due to seasonality present in data. It peaks during Nov-Dec due to holidays and increase in number of sales.
Thereafter it began to decline. The revenue generated during other months is very less.
In order to see this more clearly, we get number of transactions and average order value by month.
sdata.new <- sdata[!sdata$YearMonth %in% c("2016 Mar", "2017 Apr"),]
month.rev <- sdata.new %>%
group_by(YearMonth) %>%
summarise(revenue = sum(Revenue), transactions = sum(dummy)) %>%
mutate(AOV = round(revenue/transactions,2)) %>%
arrange(desc(revenue))
month.rev$Percentage <- round(month.rev$transactions*100/sum(month.rev$transactions),2)
month.rev
summary = list()
for(i in 1:3){
summary[[i]] = c(round(mean(as.data.frame(month.rev)[,i+1]),2), round(sqrt(var(as.data.frame(month.rev)[,i+1])),2))}
names(summary) <- c("Avg & Stdev Revevenue", "Avg & Stdev Transactions", "Avg & Stdev AOV"); summary
month.rev$levels = factor(month.rev$YearMonth, levels= month.rev$YearMonth[c(7,9,11,12,10,5,6,2,1,4,3,8)])
| YearMonth | revenue | transactions | AOV | Percentage |
|---|---|---|---|---|
| 2016 Dec | 88297 | 4065 | 21.72 | 18.68 |
| 2016 Nov | 60391 | 2563 | 23.56 | 11.78 |
| 2017 Feb | 47338 | 2201 | 21.51 | 10.12 |
| 2017 Jan | 38067 | 1827 | 20.84 | 8.40 |
| 2016 Sep | 37016 | 1984 | 18.66 | 9.12 |
| 2016 Oct | 35831 | 1668 | 21.48 | 7.67 |
| 2016 Apr | 31927 | 1570 | 20.34 | 7.22 |
| 2017 Mar | 29050 | 1347 | 21.57 | 6.19 |
| 2016 May | 24030 | 1306 | 18.40 | 6.00 |
| 2016 Aug | 21807 | 1101 | 19.81 | 5.06 |
| 2016 Jun | 19323 | 1025 | 18.85 | 4.71 |
| 2016 Jul | 17169 | 1099 | 15.62 | 5.05 |
ggplot(month.rev, aes(x = levels, y = AOV)) + geom_col() + labs(x = 'Month', y = 'Average Order Value (AOV)', title = 'Average Order Value')
ggplot(month.rev, aes(x = levels, y = transactions)) + geom_col() + labs(x = 'Month', y = 'Number of Daily Transactions', title = 'Number of Transactions') + geom_text(aes(label=Percentage),position=position_stack(vjust = 0.5))
summary
The higher revenue during Nov-Dec is due to increase in number of sales.
The Average Order Value (AOV) is more or less same throughout the year.
Standard deviation for transactions over the month is nearly half of the average (853.13) while average order value is nearly the same.
monthlysummary = sdata %>%
group_by(date(Date), Month) %>%
summarise(Tran = n_distinct(Timestamp)) %>%
ungroup()
kruskal.test(Tran ~ Month, data = monthlysummary)
Kruskal-Wallis rank sum test data: Tran by Month Kruskal-Wallis chi-squared = 162.12, df = 11, p-value < 2.2e-16
kruskal(monthlysummary$Tran, monthlysummary$Month, console = TRUE)
Study: monthlysummary$Tran ~ monthlysummary$Month
Kruskal-Wallis test's
Ties or no Ties
Critical Value: 162.1231
Degrees of freedom: 11
Pvalue Chisq : 0
monthlysummary$Month, means of the ranks
monthlysummary.Tran r
Apr 164.36957 46
Aug 109.22581 31
Dec 298.95161 31
Feb 288.01786 28
Jan 241.10000 30
Jul 90.85484 31
Jun 104.35000 30
Mar 150.01429 35
May 142.09677 31
Nov 285.11667 30
Oct 219.98387 31
Sep 247.75000 30
Post Hoc Analysis
t-Student: 1.966362
Alpha : 0.05
Groups according to probability of treatment differences and alpha level.
Treatments with the same letter are not significantly different.
monthlysummary$Tran groups
Dec 298.95161 a
Feb 288.01786 ab
Nov 285.11667 ab
Sep 247.75000 bc
Jan 241.10000 c
Oct 219.98387 c
Apr 164.36957 d
Mar 150.01429 de
May 142.09677 def
Aug 109.22581 efg
Jun 104.35000 fg
Jul 90.85484 g
As suspected the difference in number of transactions over months is statistically significant.
To find paiwise significance between tests, we ran post-hoc test and found following results -
| Month | Letter |
|---|---|
| Jan | c |
| Feb | ab |
| Mar | de |
| Apr | d |
| May | def |
| Jun | fg |
| Jul | g |
| Aug | efg |
| Sep | bc |
| Oct | c |
| Nov | ab |
| Dec | a |
Groups with the same letter are not significantly different.
Now we'll look into Sales to visit - ratio. For this purpose we'll consider only visits to product specific webpages.
# Get product visits
pdata <- data[data$Event.type=="Product",]
paste("No. of visits to product pages is:", nrow(pdata))
# Get product visits by month
# Remove April 2017 and March 2016
# Create a dummy variable having 1's
pdata$dummy <- rep(0,nrow(pdata))
pdata$dummy[pdata$Event.type=="Product"] <- 1
pdata.new <- pdata[!pdata$YearMonth %in% c("2016 Mar", "2017 Apr"),]
month.prod <- pdata.new %>%
group_by(YearMonth) %>%
summarise(Visits = sum(dummy)) %>%
ungroup()
month.prod$sales <- rep(0,12)
month.prod$sales.to.visits.ratio <- rep(0,12)
for(i in 1:12){
month.prod$sales[i] <- month.rev$transactions[which(month.rev$YearMonth == month.prod$YearMonth[i])]
}
for(i in 1:12){
month.prod$sales.to.visits.ratio[i] <- round(month.prod$sales[i]/month.prod$Visits[i], 2)
}
month.prod[order(-month.prod$sales.to.visits.ratio),]
| YearMonth | Visits | sales | sales.to.visits.ratio |
|---|---|---|---|
| 2016 Dec | 41799 | 4065 | 0.10 |
| 2016 Apr | 16747 | 1570 | 0.09 |
| 2016 Nov | 27295 | 2563 | 0.09 |
| 2017 Feb | 24607 | 2201 | 0.09 |
| 2017 Jan | 23306 | 1827 | 0.08 |
| 2017 Mar | 16861 | 1347 | 0.08 |
| 2016 Jul | 15979 | 1099 | 0.07 |
| 2016 Oct | 23700 | 1668 | 0.07 |
| 2016 Sep | 27979 | 1984 | 0.07 |
| 2016 May | 21312 | 1306 | 0.06 |
| 2016 Jun | 18673 | 1025 | 0.05 |
| 2016 Aug | 25469 | 1101 | 0.04 |
levels(factor(data$Device.type))
levels(data$Environment)
Let us see which environment was used in which device.
t(table(data$Environment, data$Device.type))
app_android web
Android - Smartphone 0 12213
Android - Tablet 0 5138
Desktop 0 286511
iPad 0 12423
iPhone 0 22678
Mobile - Other 0 4597
Unknown 41281 0
Evidently, the category "Unknown" of the devices is made up of app_android environment.
It is necessary to know to evaluate the difference in performance of mobile web browsing and android applications
First we'll look at the usage of devices over the year. For this purpose, we'll divide the devices into two distinct categories:
data$Device <- rep(0, nrow(data))
data$Dummy2 = data$Dummy1 = data$Device = data$Dummy
data$Device.type = as.character(data$Device.type)
data$Device[data$Device.type %in% c("Android - Smartphone", "Android - Tablet", "iPhone" , "Mobile - Other", "Unknown")] = "Mobile"
data$Device[data$Device.type == "Desktop"] = "Desktop"
data$Device[data$Device.type %in% c("iPad", "Android - Tablet")] = "Tablet"
data$Dummy[data$Event.type == "Sales"] = 1
data$Dummy1[data$Event.type == "Product"] = 1
data$Dummy2[data$Event.type == "Basket"] = 1
data$Device <- factor(data$Device)
levels(data$Device)
Revenuebydevices <- data[data$Dummy==1,] %>%
group_by(Device) %>%
summarise(Revenue = sum(Revenue, na.rm=TRUE), transactions = sum(Dummy)) %>%
mutate(AOV = round(Revenue/transactions,2), Revenue.pct = round(Revenue / sum(Revenue) * 100, 2)) %>%
arrange(desc(Revenue))
Visitsbydevices <- data[data$Dummy1==1,] %>%
group_by(Device) %>%
summarise(Visits = sum(Dummy1))
Basketsbydevices <- data[data$Dummy2==1,] %>%
group_by(Device) %>%
summarise(Baskets = sum(Dummy2), Basketvalue = sum(Revenue)) %>%
mutate(ABV = round(Basketvalue/Baskets,2))
Revenuebydevices$sales.to.visits.ratio <- round(Revenuebydevices$transactions/Visitsbydevices$Visits,2); Revenuebydevices$ABV <- Basketsbydevices$ABV; Revenuebydevices$Product.visits <- Visitsbydevices$Visits
Warning message: "Factor `Device` contains implicit NA, consider using `forcats::fct_explicit_na`"Warning message: "Factor `Device` contains implicit NA, consider using `forcats::fct_explicit_na`"Warning message: "Factor `Device` contains implicit NA, consider using `forcats::fct_explicit_na`"
Revenuebydevices <- Revenuebydevices[complete.cases(Revenuebydevices),]; Revenuebydevices
ggplot(Revenuebydevices, aes(x="", y=Revenue/1000, fill=Device))+ ylab("Revenue (in 1000's)") + geom_bar(width = 1, stat = "identity") + coord_polar("y", start=0) + scale_fill_brewer(palette="Blues")+ theme_minimal()
| Device | Revenue | transactions | AOV | Revenue.pct | sales.to.visits.ratio | ABV | Product.visits |
|---|---|---|---|---|---|---|---|
| Desktop | 359827 | 16314 | 22.06 | 76.92 | 0.08 | 22.72 | 209837 |
| Mobile | 102373 | 5900 | 17.35 | 21.88 | 0.09 | 17.93 | 67624 |
| Tablet | 5602 | 324 | 17.29 | 1.20 | 0.02 | 19.46 | 15774 |
Around 77 % of the revenue was generated from Desktop while close to 22 % was generated on Mobile.
Average basket value was almost consistent with the Average order value.
Looking at the sales to visits (To product specific pages) ratio, it is clear that for tablets the ratio was much lower which means that on tablets people proportion of visitors who make a sale is quite less (2%).
We'll see if there is any statistically significant different between devices when it comes to generating Revenue and their sales to visits ratio.
Since this is a count data and the categories: Products and Transactions may not be indenpendent. So, we used Fisher's exact test instead of Chi Square Test.
M <- as.matrix(cbind(Revenuebydevices$transactions, as.factor(Revenuebydevices$Product.visits)))
fisher.test(M, y = Revenuebydevices$Device)
Fisher's Exact Test for Count Data data: M p-value = 0.05802 alternative hypothesis: two.sided
kruskal.test(Revenue ~ as.factor(Device), data = data[data$Event.type=="Sales",])
# Excluding tablets to compare Desktop and Mobile
kruskal(data[data$Event.type=="Sales",]$Revenue, as.factor(data[data$Event.type=="Sales",]$Device), console = TRUE)
Kruskal-Wallis rank sum test data: Revenue by as.factor(Device) Kruskal-Wallis chi-squared = 481.17, df = 2, p-value < 2.2e-16
Study: data[data$Event.type == "Sales", ]$Revenue ~ as.factor(data[data$Event.type == "Sales", ]$Device)
Kruskal-Wallis test's
Ties or no Ties
Critical Value: 481.1672
Degrees of freedom: 2
Pvalue Chisq : 0
as.factor(data[data$Event.type == "Sales", ]$Device), means of the ranks
data.data.Event.type.....Sales.....Revenue r
Desktop 11855.49 16314
Mobile 9712.63 5900
Tablet 10114.46 324
Post Hoc Analysis
t-Student: 1.960069
Alpha : 0.05
Groups according to probability of treatment differences and alpha level.
Treatments with the same letter are not significantly different.
data[data$Event.type == "Sales", ]$Revenue groups
Desktop 11855.49 a
Tablet 10114.46 b
Mobile 9712.63 b
When it comes to user aquisition, both desktop and Mobile perform equally well.
Difference in generated revenue between Mobiles and Desktops is statistically significant. More revenue can be generated by increasing the number of visits to the retailer's website since the sales to product visits ratio is not statistically significant among devices.
Now we'll look at the distribution of Revenue generated, Number of transactions and Average order value throughout the year for individual device type.
Mobile <- data[data$Device.type %in% c("Android - Smartphone", "Android - Tablet", "iPhone" , "Mobile - Other", "Unknown"),]
data.new <- Mobile[!Mobile$YearMonth %in% c("2016 Mar", "2017 Apr"),]
dataforplot <- data.new[data.new$Dummy==1,] %>%
group_by(YearMonth) %>%
summarise(Revenue = sum(Revenue, na.rm=TRUE), Tran = sum(Dummy, na.rm =TRUE)) %>%
mutate(AOV = round(Revenue/Tran,2))
dataforplot <- dataforplot[complete.cases(dataforplot),]
dataforplot$Percentage <- round(dataforplot$Revenue*100/sum(dataforplot$Revenue),2)
dataforplot$Percentage1 <- round(dataforplot$Tran*100/sum(dataforplot$Tran),2)
dataforplot$Percentage2 <- round(dataforplot$AOV*100/sum(dataforplot$AOV),2)
dataforplot$levels = factor(dataforplot$YearMonth, levels= dataforplot$YearMonth[c(1, 6,5,4,2,9,8,7,3,11,10,12)])
temp = list(); temp[[1]] <- c(round(mean(dataforplot$Tran),2), round(sqrt(var(dataforplot$Tran)),2)); names(temp) <- "Average and Standard deviation Transactions"; temp
temp = list(); temp[[1]] <- c(round(mean(dataforplot$Revenue),2), round(sqrt(var(dataforplot$Revenue)),2)); names(temp) <- "Average and Standard deviation Revenue"; temp
temp = list(); temp[[1]] <- c(round(mean(dataforplot$AOV),2), round(sqrt(var(dataforplot$AOV)),2)); names(temp) <- "Average and Standard deviation AOV"; temp
The average order value per month through mobile (16.55 Euros)(Including applications) was less than the average order value per month overall (Around 20 Euros).
The variation in average order value for Mobile is more or less comparable to the one we saw for overall Average order value
ggplot(dataforplot, aes(x = levels, y = Revenue)) + geom_col() + labs(x = 'Month', y = '% Total Revenue', title = '% Revenue (Mobile)') + geom_text(aes(label=Percentage),position=position_stack(vjust = 0.5))
ggplot(dataforplot, aes(x = levels, y = Tran)) + geom_col() + labs(x = 'Month', y = '% Number of Transactions', title = '% Number of Transactions (Mobile)') + geom_text(aes(label=Percentage1),position=position_stack(vjust = 0.5))
ggplot(dataforplot, aes(x = levels, y = AOV)) + geom_col() + labs(x = 'Month', y = 'Average Order Value', title = 'Average Order Value (Mobile)')
Desktop <- data[data$Device.type == "Desktop",]
data.new <- Desktop[!Desktop$YearMonth %in% c("2016 Mar", "2017 Apr"),]
dataforplot <- data.new[data.new$Dummy==1,] %>%
group_by(YearMonth) %>%
summarise(Revenue = sum(Revenue, na.rm=TRUE), Tran = sum(Dummy)) %>%
mutate(AOV = round(Revenue/Tran,2))
dataforplot <- dataforplot[complete.cases(dataforplot),]
dataforplot$levels = factor(dataforplot$YearMonth, levels= dataforplot$YearMonth[c(1, 6,5,4,2,9,8,7,3,11,10,12)])
temp = list(); temp[[1]] <- c(round(mean(dataforplot$Tran),2), round(sqrt(var(dataforplot$Tran)),2)); names(temp) <- "Average and Standard deviation Desktop Transactions"; temp
temp = list(); temp[[1]] <- c(round(mean(dataforplot$Revenue),2), round(sqrt(var(dataforplot$Revenue)),2)); names(temp) <- "Average and Standard deviation Desktop Revenue"; temp
temp = list(); temp[[1]] <- c(round(mean(dataforplot$AOV),2), round(sqrt(var(dataforplot$AOV)),2)); names(temp) <- "Average and Standard deviation Desktop AOV"; temp
The average order value for Desktop has been better than mobile.
ggplot(dataforplot, aes(x = levels, y = Revenue)) + geom_col() + labs(x = 'Month', y = 'Total Revenue', title = 'Revenue (Desktop)')
ggplot(dataforplot, aes(x = levels, y = Tran)) + geom_col() + labs(x = 'Month', y = 'Number of Daily Transactions', title = 'Number of Transactions (Desktop)')
ggplot(dataforplot, aes(x = levels, y = AOV)) + geom_col() + labs(x = 'Month', y = 'Average Order Value', title = 'Average Order Value (Desktop)')
Mobile.excl.unknown <- data[data$Device.type %in% c("Android - Smartphone", "Android - Tablet", "iPhone" , "Mobile - Other"),]
data.new <- Mobile.excl.unknown[!Mobile.excl.unknown$YearMonth %in% c("2016 Mar", "2017 Apr"),]
dataforplot <- data.new[data.new$Dummy==1,] %>%
group_by(YearMonth) %>%
summarise(Revenue = sum(Revenue, na.rm=TRUE), Tran = sum(Dummy, na.rm=TRUE)) %>%
mutate(AOV = round(Revenue/Tran,2))
dataforplot <- dataforplot[complete.cases(dataforplot),]
dataforplot$levels = factor(dataforplot$YearMonth, levels= dataforplot$YearMonth[c(1, 6,5,4,2,9,8,7,3,11,10,12)])
temp = list(); temp[[1]] <- c(round(mean(dataforplot$Tran),2), round(sqrt(var(dataforplot$Tran)),2)); names(temp) <- "Average and Standard deviation Transactions"; temp
temp = list(); temp[[1]] <- c(round(mean(dataforplot$Revenue),2), round(sqrt(var(dataforplot$Revenue)),2)); names(temp) <- "Average and Standard deviation Revenue"; temp
ggplot(dataforplot, aes(x = levels, y = Revenue)) + geom_col() + labs(x = 'Month', y = 'Total Revenue', title = 'Revenue (Mobile Web Browsing)')
ggplot(dataforplot, aes(x = levels, y = Tran)) + geom_col() + labs(x = 'Month', y = 'Number of Daily Transactions', title = 'Number of Transactions (Mobile Web Browsing)')
ggplot(dataforplot, aes(x = levels, y = AOV)) + geom_col() + labs(x = 'Month', y = 'Average Order Value', title = 'Average Order Value (Mobile Web Browsing)')
We see that total revenue generated from Mobile web browsing fluctuated quite a lot with a standard deviation of around 340 Euros.
We also saw that the number of transactions has been decreased when it was increased for desktop.
Therefore it is necessary to look at data generated through User Agent: App_Android (or device: Uknown)
unknown <- data[data$Device.type == "Unknown",]
min(unknown$Date)
max(unknown$Date)
[1] "2016-09-01 09:39:07 CEST"
[1] "2017-04-17 00:55:07 CEST"
data.new <- unknown[!unknown$YearMonth %in% c("2016 Mar", "2017 Apr"),]
dataforplot <- data.new[data.new$Dummy==1,] %>%
group_by(YearMonth) %>%
summarise(Revenue = sum(Revenue, na.rm=TRUE), Tran = sum(Dummy, na.rm = TRUE)) %>%
mutate(AOV = round(Revenue/Tran,2))
dataforplot <- dataforplot[complete.cases(dataforplot),]
temp = list(); temp[[1]] <- c(round(mean(dataforplot$Tran),2), round(sqrt(var(dataforplot$Tran)),2)); names(temp) <- "Average and Standard deviation Transactions through Android Application"; temp
temp = list(); temp[[1]] <- c(round(mean(dataforplot$Revenue),2), round(sqrt(var(dataforplot$Revenue)),2)); names(temp) <- "Average and Standard deviation Revenue through Android Application"; temp
dataforplot$levels = factor(dataforplot$YearMonth, levels= dataforplot$YearMonth[c(4,3,2,1,6,5,7)])
ggplot(dataforplot, aes(x = levels, y = Revenue)) + geom_col() + labs(x = 'Month', y = 'Total Revenue', title = 'Revenue (Android Applications)')
ggplot(dataforplot, aes(x = levels, y = AOV)) + geom_col() + labs(x = 'Month', y = 'Average Order Value', title = 'Average Order Value (Android Applications)')
ggplot(dataforplot, aes(x = levels, y = Tran)) + geom_col() + labs(x = 'Month', y = 'Number of Daily Transactions', title = 'Number of Transactions (Android Applications)')
Same goes for transactions.
The does not vary much.
Before the launch of the application, the number of transactions through mobile phones were less than 200 per month
After the launch, the number of transactions through phones are averaging 1000. So, it is fair to say that Applications perform better than mobile web browsing.
listing <- data[data$Event.type == "Listing",]
paste("first listing date is: ", min(listing$Date), "and its range is: ", round((max(listing$Date)-min(listing$Date)),2))
listing.mobile <- listing[!listing$Device.type %in% c("Desktop", "Tablet"),] # Mobile
range = max(listing.mobile$Date)-min(listing.mobile$Date)
paste("first listing event type on mobile: ", min(listing.mobile$Date), "and its range is: ", round(range,2))
# For desktop
listing.desktop <- listing[listing$Device.type == "Desktop",] # Desktop
paste("first listing date on Desktop is: ", min(listing.desktop$Date), "and its range is: ", round(max(listing.desktop$Date)-min(listing.desktop$Date),2))
Before 22 March 2017, there was no page dedicated to products belonging to the same category.
We'll check if there has been any change in sales to visits ratio after inclusion (supposedly) of pages where products of similar kinds are listed.
listing1 <- data[data$Event.type %in% c("Listing", "Sales", "Product"), ]
listing1 <- listing1[listing1$Date >= min(listing.mobile$Date),]
listing1 <- listing1[listing1$Date <= max(listing.mobile$Date),]
sales <- nrow(listing1[listing1$Event.type=="Sales",])
visits <- nrow(listing1[listing1$Event.type %in% c("Listing", "Product"),])
svisits.listing <- sales/visits
round(svisits.listing, 2)
The sales to visits ratio for the period is 0.08 which is better than the previous year.
listing <- data[data$Event.type == "Basket",]
paste("first listing date is: ", min(listing$Date), "and its range is: ", round((max(listing$Date)-min(listing$Date)),2))
# For mobile
listing.mobile <- listing[!listing$Device.type %in% c("Desktop", "Tablet"),] # Mobile
range = max(listing.mobile$Date)-min(listing.mobile$Date)
paste("first listing event type on mobile: ", min(listing.mobile$Date), "and its range is: ", round(range,2))
# For desktop
listing.desktop <- listing[listing$Device.type == "Desktop",] # Desktop
paste("first listing date is: ", min(listing.desktop$Date), "and its range is: ", round(max(listing.desktop$Date)-min(listing.desktop$Date),2))
In order to make our approach to market to the customers, we should have a way to:
userdata <- sdata[sdata$Cross.device.user.ID!="NULL",]
paste("Number of User IDs available is: ", nrow(userdata))
We'll create a new category "Lcustomers" (Short for Loyal customers)
This category will cosist of those customers who have shopped with this retailer more than twice.
We shall also create two other categories -
First, we'll look at how many distinct customers are there, and how many and which items they have purchased. We'll also consider the total amount they spent, and days since their last purchase
userdata$package[userdata$Product.category=="Packages"] = 1
userdata$medium[userdata$Product.category=="Medium Items"] = 1
userdata$small[userdata$Product.category=="Small Items"] = 1
userdata$large[userdata$Product.category=="Large Items"] = 1
userdata$dummy[userdata$Event.type=="Sales"] = 1
userdata$date <- as.Date(userdata$Date)
users <- userdata %>%
group_by(Cross.device.user.ID) %>%
summarise(package = sum(package), medium = sum(medium), small=sum(small), large=sum(large), Tran = sum(dummy), Revenue = sum(Revenue), quantity = sum(Quantity, na.rm=TRUE), firstdate = min(date), lastdate = max(date)) %>%
mutate(time.first = as.numeric(max(as.Date(sdata$Date)) - firstdate), time.last = as.numeric(max(as.Date(sdata$Date)) - lastdate))
users[is.na(users)] = 0
head(users)
paste("Number of users with 1 transaction is: ", nrow(users[users$Tran==1,]))
paste("Number of users with more than 1 transactions is: ", nrow(users[users$Tran>1,]))
paste("Number of users with more than 2 transactions is: ", nrow(users[users$Tran>2,]))
paste("Maximum number of transaction for one user is: ", max(users$Tran))
paste("Avg number of transactions per user is is: ", round(mean(users$Tran),2))
paste("Maximum number of product quantities bought for one user is: ", max(users$quantity))
paste("Avg product quantities bought is: ", round(mean(users$quantity),2))
paste("For users, correlation between number of transactions and revenue is: ",round(cor(users$Revenue, users$Tran),2))
paste("Maximum difference between first and last date of purchase is: ", max(users$time.first-users$time.last))
vector <- c(sum(users$package), sum(users$medium), sum(users$small), sum(users$large))
names <- c("package", "medium", "small", "large")
paste("Product type bought most is: ", names[which(vector==max(vector))])
| Cross.device.user.ID | package | medium | small | large | Tran | Revenue | quantity | firstdate | lastdate | time.first | time.last |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 00146fb6a0c24b74232d | 1 | 0 | 0 | 0 | 1 | 15 | 1 | 2017-03-25 | 2017-03-25 | 22 | 22 |
| 003d60169becee4e390d | 0 | 1 | 0 | 0 | 1 | 17 | 1 | 2017-03-03 | 2017-03-03 | 44 | 44 |
| 0086c468fd6bc0d2ff61 | 0 | 0 | 1 | 0 | 1 | 10 | 1 | 2017-03-15 | 2017-03-15 | 32 | 32 |
| 00b1503d497a961b919e | 0 | 0 | 1 | 0 | 1 | 28 | 1 | 2017-03-08 | 2017-03-08 | 39 | 39 |
| 00de87f3528f3597af00 | 0 | 0 | 1 | 0 | 1 | 23 | 1 | 2017-03-22 | 2017-03-22 | 25 | 25 |
| 00eeb91a2c9ff36766a4 | 1 | 0 | 0 | 0 | 1 | 57 | 1 | 2017-04-09 | 2017-04-09 | 7 | 7 |
Lcustomers <- users[users$Tran > 2,]
head(Lcustomers); nrow(Lcustomers)
vector <- c(sum(Lcustomers$package), sum(Lcustomers$medium), sum(Lcustomers$small), sum(Lcustomers$large))
names <- c("package", "medium", "small", "large")
paste("Product type bought most is: ", names[which(vector==max(vector))])
| Cross.device.user.ID | package | medium | small | large | Tran | Revenue | quantity | firstdate | lastdate | time.first | time.last |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 02a7b34386ba592c1ef6 | 0 | 0 | 3 | 0 | 3 | 21 | 3 | 2017-04-01 | 2017-04-05 | 15 | 11 |
| 0373ee30ba2f1f1ca8e4 | 0 | 0 | 0 | 0 | 3 | 29 | 3 | 2017-03-14 | 2017-03-16 | 33 | 31 |
| 039df8d15005b9008876 | 0 | 0 | 0 | 0 | 4 | 93 | 4 | 2017-02-25 | 2017-02-25 | 50 | 50 |
| 03a6bc25103535721051 | 0 | 0 | 0 | 0 | 4 | 67 | 5 | 2017-02-22 | 2017-04-09 | 53 | 7 |
| 0541abca3ea8a78ce2d8 | 0 | 0 | 0 | 0 | 3 | 144 | 3 | 2017-04-04 | 2017-04-09 | 12 | 7 |
| 057bb52d2a6c777bfad3 | 0 | 0 | 0 | 0 | 3 | 50 | 3 | 2017-02-14 | 2017-02-28 | 61 | 47 |
Scustomers <- users[users$time.last > 60,]
head(Scustomers); nrow(Scustomers)
vector <- c(sum(Scustomers$package), sum(Scustomers$medium), sum(Scustomers$small), sum(Scustomers$large))
names <- c("package", "medium", "small", "large")
paste("Product type bought most is: ", names[which(vector==max(vector))])
| Cross.device.user.ID | package | medium | small | large | Tran | Revenue | quantity | firstdate | lastdate | time.first | time.last |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 03192220e87f8e23b5e5 | 0 | 0 | 1 | 0 | 1 | 9 | 1 | 2017-02-14 | 2017-02-14 | 61 | 61 |
| 043aff833259514934ab | 0 | 0 | 0 | 1 | 1 | 18 | 1 | 2017-02-13 | 2017-02-13 | 62 | 62 |
| 131350b39be3338fea43 | 0 | 1 | 0 | 0 | 1 | 9 | 1 | 2017-02-13 | 2017-02-13 | 62 | 62 |
| 13d56f9d9e836a10ae7f | 0 | 0 | 1 | 0 | 1 | 7 | 1 | 2017-02-14 | 2017-02-14 | 61 | 61 |
| 14397d06bcab24fd7d8a | 0 | 0 | 1 | 0 | 1 | 6 | 1 | 2017-02-13 | 2017-02-13 | 62 | 62 |
| 1ec74f90e0bd3b97414d | 1 | 0 | 0 | 0 | 1 | 21 | 1 | 2017-02-13 | 2017-02-13 | 62 | 62 |
AOVquantile <- quantile(users$Revenue/users$Tran)
AOVquantile
Icustomers <- users[users$Revenue/users$Tran > AOVquantile[4],]
head(Icustomers); nrow(Icustomers)
vector <- c(sum(Icustomers$package), sum(Icustomers$medium), sum(Icustomers$small), sum(Icustomers$large))
names <- c("package", "medium", "small", "large")
paste("Product type bought most is: ", names[which(vector==max(vector))])
| Cross.device.user.ID | package | medium | small | large | Tran | Revenue | quantity | firstdate | lastdate | time.first | time.last |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 00b1503d497a961b919e | 0 | 0 | 1 | 0 | 1 | 28 | 1 | 2017-03-08 | 2017-03-08 | 39 | 39 |
| 00eeb91a2c9ff36766a4 | 1 | 0 | 0 | 0 | 1 | 57 | 1 | 2017-04-09 | 2017-04-09 | 7 | 7 |
| 013c56f4157b7cd6d644 | 0 | 1 | 0 | 0 | 1 | 37 | 1 | 2017-03-08 | 2017-03-08 | 39 | 39 |
| 0171a55cdca297aca753 | 0 | 0 | 0 | 0 | 2 | 69 | 2 | 2017-04-03 | 2017-04-03 | 13 | 13 |
| 017cbe8dcd707a1fe1dc | 1 | 0 | 0 | 0 | 1 | 63 | 1 | 2017-02-19 | 2017-02-19 | 56 | 56 |
| 03b4240419b1c57d3180 | 0 | 0 | 1 | 0 | 1 | 26 | 1 | 2017-02-26 | 2017-02-26 | 49 | 49 |
These three categories of customers can be used to market better towards customers.
set.seed(123)
# Compute and plot wss for k = 2 to k = 15.
k.max <- 15
udata <- users[,-c(1,9,10)]
wss <- sapply(1:k.max,
function(k){kmeans(udata, k, nstart=50,iter.max = 15)$tot.withinss})
wss
plot(1:k.max, wss,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares")
If we chose k=3, the total within-clusters explained variation won't be increased that much by increasing the size of clusters. Now, we will run k-means clustering using k=4
kmean.res <- kmeans(udata, 3)
kmean.res
K-means clustering with 3 clusters of sizes 63, 786, 534
Cluster means:
package medium small large Tran Revenue quantity
1 0.8412698 0.03174603 0.4444444 0.04761905 3.238095 141.04762 3.555556
2 0.3346056 0.23664122 0.4033079 0.03816794 1.262087 24.94911 1.298982
3 0.3389513 0.20411985 0.4531835 0.04119850 1.252809 24.02622 1.280899
time.first time.last
1 39.0000 32.22222
2 48.1056 47.43384
3 16.6030 15.29963
Clustering vector:
[1] 3 2 3 2 3 3 2 2 3 3 2 3 2 2 2 2 2 3 2 3 2 3 2 3 1 3 2 2 2 3 3 2 2 3 2 3 2
[38] 3 2 1 2 3 2 3 1 3 1 2 2 2 3 2 2 1 3 2 2 2 2 2 2 2 2 2 3 2 2 2 3 2 3 2 2 1
[75] 3 2 2 2 2 2 3 3 2 3 3 2 3 3 2 2 1 2 2 2 2 3 2 2 2 3 3 2 2 2 3 3 2 3 2 3 2
[112] 3 3 2 2 2 3 2 2 1 3 3 3 3 2 2 2 3 2 3 2 2 3 2 1 2 2 3 2 3 3 3 2 2 3 2 2 3
[149] 2 2 3 2 3 3 2 3 3 3 3 2 2 2 2 2 3 2 2 3 2 2 3 1 2 1 3 3 2 2 2 2 2 2 3 3 2
[186] 2 3 2 2 3 3 3 2 2 3 2 2 3 3 2 2 2 1 3 2 1 3 3 2 1 3 2 2 3 3 2 2 2 2 2 2 2
[223] 1 2 2 3 2 2 3 3 3 2 2 2 3 3 2 3 3 3 2 2 2 2 2 2 3 2 2 3 3 2 2 2 3 3 3 2 1
[260] 2 3 3 2 1 3 2 3 2 3 2 3 3 3 2 2 3 3 3 1 2 3 2 3 2 2 2 2 2 1 2 3 2 2 3 2 2
[297] 2 3 2 3 3 3 2 2 1 3 3 2 3 3 3 2 2 3 2 3 3 3 2 2 3 1 3 2 2 2 3 3 2 2 2 2 2
[334] 3 2 3 3 2 2 3 2 2 2 2 2 2 3 2 1 3 2 2 2 2 3 2 2 2 2 2 2 1 3 2 2 2 3 3 2 2
[371] 3 2 2 3 3 3 1 3 2 3 2 3 2 2 2 2 3 2 2 3 1 2 2 3 3 3 3 2 2 2 2 3 2 3 2 2 2
[408] 2 2 3 2 2 3 2 3 3 2 3 2 2 2 3 3 2 3 2 2 3 3 2 2 2 3 3 3 2 2 2 2 2 2 3 3 2
[445] 2 2 3 2 2 2 3 2 2 2 2 3 3 3 2 2 2 3 3 2 3 3 2 2 3 2 2 3 2 2 2 2 2 2 2 2 3
[482] 3 2 2 3 2 3 2 1 3 2 2 2 2 3 3 3 2 3 3 3 2 2 3 3 3 2 2 2 2 3 2 2 3 2 3 2 2
[519] 3 2 3 3 2 3 2 2 2 3 1 3 3 2 2 1 2 3 2 2 3 3 3 2 2 3 3 3 2 2 2 2 3 2 2 3 3
[556] 2 2 3 2 3 3 3 3 3 3 3 2 2 1 3 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 3 3 2 3 2 2
[593] 2 2 3 2 2 2 3 2 3 2 2 2 3 2 1 2 1 3 2 3 2 2 3 1 2 3 2 3 3 2 3 3 3 2 3 2 2
[630] 3 3 2 2 2 2 2 2 3 2 2 2 3 1 2 2 3 2 2 3 3 3 2 3 2 3 3 2 3 3 3 3 2 2 2 2 3
[667] 2 2 3 1 2 3 2 2 2 3 2 2 3 2 2 3 2 2 2 2 2 2 2 3 2 2 1 2 2 2 3 3 2 3 2 2 2
[704] 2 3 2 3 2 2 2 2 3 2 2 2 2 2 3 2 2 3 2 2 3 2 2 2 2 3 2 2 3 2 2 2 2 3 3 3 2
[741] 3 2 2 3 3 3 2 2 2 3 2 2 3 2 3 2 2 2 2 2 2 3 3 3 3 3 3 3 2 2 3 3 3 2 2 2 2
[778] 3 3 2 3 3 2 3 2 3 3 3 2 3 3 2 2 2 3 2 3 2 2 3 3 2 3 1 2 3 3 2 3 2 2 2 3 3
[815] 2 2 3 3 2 3 3 2 1 2 2 2 3 3 3 2 2 2 2 2 2 2 3 2 2 2 2 3 3 1 2 3 1 2 1 2 2
[852] 2 3 3 2 2 3 2 3 2 2 2 2 2 2 2 3 2 3 3 2 3 3 2 2 3 3 3 2 2 2 3 2 2 3 2 3 3
[889] 2 3 2 2 2 2 2 3 2 3 2 3 3 2 2 3 2 2 3 3 2 3 2 1 1 2 2 3 2 2 2 2 3 1 3 2 2
[926] 3 3 3 3 3 2 2 3 3 2 3 3 3 3 2 2 2 3 2 2 2 2 3 2 2 2 3 2 2 2 3 2 2 2 3 2 2
[963] 3 3 3 3 2 2 3 2 2 2 3 2 2 2 3 2 3 1 3 3 3 3 2 3 2 3 2 3 2 2 2 3 3 3 2 2 1
[1000] 2 2 2 3 2 3 2 3 2 2 2 2 3 2 3 3 2 2 2 1 2 2 2 2 2 3 2 3 3 3 2 1 3 3 2 2 2
[1037] 2 2 2 3 2 2 3 2 2 2 1 3 2 2 1 2 3 2 3 2 3 3 2 2 2 2 2 3 2 2 2 2 2 2 1 3 2
[1074] 2 2 2 2 2 3 2 1 2 2 3 2 3 2 2 2 3 3 3 3 2 2 3 3 2 2 3 2 2 3 2 3 2 3 3 3 2
[1111] 3 3 2 3 3 3 3 2 3 2 3 3 2 2 2 2 3 3 2 2 2 2 2 3 2 2 3 3 3 1 3 3 2 2 1 2 2
[1148] 2 3 2 3 2 2 2 2 2 3 2 2 2 3 2 2 3 2 3 3 2 3 3 3 2 2 2 2 2 3 2 2 2 3 2 2 3
[1185] 2 1 1 3 3 2 3 3 3 2 2 2 2 3 3 3 2 3 2 3 2 2 2 2 3 3 2 2 2 3 3 2 1 3 2 2 2
[1222] 2 2 3 2 3 3 3 3 2 3 2 3 2 2 2 3 2 3 2 3 2 2 2 2 1 2 3 2 3 3 2 2 2 2 3 2 3
[1259] 2 3 3 3 3 2 3 3 3 2 2 1 3 3 3 2 2 2 2 2 3 2 1 2 3 2 3 2 3 3 2 3 3 1 3 2 2
[1296] 2 3 2 2 3 2 2 2 3 1 3 3 2 2 2 2 3 3 2 2 3 2 3 2 3 3 2 3 2 3 3 2 3 3 2 2 2
[1333] 2 2 3 2 2 2 2 2 3 2 2 2 3 3 2 3 3 2 2 2 1 2 2 2 3 2 3 2 3 3 2 3 2 2 2 2 2
[1370] 2 3 2 2 3 2 2 3 3 3 2 1 2 3
Within cluster sum of squares by cluster:
[1] 356193.5 346187.0 232988.1
(between_SS / total_SS = 61.0 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"
kmean.res$cluster <- as.factor(kmean.res$cluster)
clusteredusers <- users
clusteredusers$cluster <- kmean.res$cluster
head(clusteredusers)
paste("Size of cluster", c(1,2,3), ":", summary(as.factor(clusteredusers$cluster)))
| Cross.device.user.ID | package | medium | small | large | Tran | Revenue | quantity | firstdate | lastdate | time.first | time.last | cluster |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 00146fb6a0c24b74232d | 1 | 0 | 0 | 0 | 1 | 15 | 1 | 2017-03-25 | 2017-03-25 | 22 | 22 | 3 |
| 003d60169becee4e390d | 0 | 1 | 0 | 0 | 1 | 17 | 1 | 2017-03-03 | 2017-03-03 | 44 | 44 | 2 |
| 0086c468fd6bc0d2ff61 | 0 | 0 | 1 | 0 | 1 | 10 | 1 | 2017-03-15 | 2017-03-15 | 32 | 32 | 3 |
| 00b1503d497a961b919e | 0 | 0 | 1 | 0 | 1 | 28 | 1 | 2017-03-08 | 2017-03-08 | 39 | 39 | 2 |
| 00de87f3528f3597af00 | 0 | 0 | 1 | 0 | 1 | 23 | 1 | 2017-03-22 | 2017-03-22 | 25 | 25 | 3 |
| 00eeb91a2c9ff36766a4 | 1 | 0 | 0 | 0 | 1 | 57 | 1 | 2017-04-09 | 2017-04-09 | 7 | 7 | 3 |
summary(clusteredusers[clusteredusers$cluster==1,])
summary(clusteredusers[clusteredusers$cluster==2,])
summary(clusteredusers[clusteredusers$cluster==3,])
Cross.device.user.ID package medium
039df8d15005b9008876: 1 Min. :0.0000 Min. :0.00000
0541abca3ea8a78ce2d8: 1 1st Qu.:0.0000 1st Qu.:0.00000
072a0c5c702131eccff9: 1 Median :0.0000 Median :0.00000
076e8674be4d2ad37a8d: 1 Mean :0.8413 Mean :0.03175
0914ddf25bea3630e32b: 1 3rd Qu.:2.0000 3rd Qu.:0.00000
0d2369ec05e610947ed1: 1 Max. :5.0000 Max. :2.00000
(Other) :57
small large Tran Revenue
Min. :0.0000 Min. :0.00000 Min. :1.000 Min. : 83
1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:2.000 1st Qu.:101
Median :0.0000 Median :0.00000 Median :3.000 Median :115
Mean :0.4444 Mean :0.04762 Mean :3.238 Mean :141
3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:4.000 3rd Qu.:161
Max. :7.0000 Max. :3.00000 Max. :8.000 Max. :559
quantity firstdate lastdate time.first
Min. :1.000 Min. :2017-02-13 Min. :2017-02-14 Min. : 3.0
1st Qu.:2.000 1st Qu.:2017-02-25 1st Qu.:2017-02-27 1st Qu.:26.5
Median :3.000 Median :2017-02-28 Median :2017-03-16 Median :47.0
Mean :3.556 Mean :2017-03-08 Mean :2017-03-14 Mean :39.0
3rd Qu.:4.000 3rd Qu.:2017-03-20 3rd Qu.:2017-03-30 3rd Qu.:50.0
Max. :8.000 Max. :2017-04-13 Max. :2017-04-13 Max. :62.0
time.last cluster
Min. : 3.00 1:63
1st Qu.:16.50 2: 0
Median :31.00 3: 0
Mean :32.22
3rd Qu.:48.00
Max. :61.00
Cross.device.user.ID package medium
003d60169becee4e390d: 1 Min. :0.0000 Min. :0.0000
00b1503d497a961b919e: 1 1st Qu.:0.0000 1st Qu.:0.0000
011ba287dd801f3f98c0: 1 Median :0.0000 Median :0.0000
013c56f4157b7cd6d644: 1 Mean :0.3346 Mean :0.2366
017cbe8dcd707a1fe1dc: 1 3rd Qu.:1.0000 3rd Qu.:0.0000
01c32cefc97342474024: 1 Max. :4.0000 Max. :4.0000
(Other) :780
small large Tran Revenue
Min. :0.0000 Min. :0.00000 Min. :1.000 Min. : 0.00
1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:13.00
Median :0.0000 Median :0.00000 Median :1.000 Median :19.00
Mean :0.4033 Mean :0.03817 Mean :1.262 Mean :24.95
3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:1.000 3rd Qu.:33.00
Max. :4.0000 Max. :2.00000 Max. :4.000 Max. :84.00
quantity firstdate lastdate time.first
Min. :1.000 Min. :2017-02-13 Min. :2017-02-13 Min. :32.00
1st Qu.:1.000 1st Qu.:2017-02-20 1st Qu.:2017-02-20 1st Qu.:42.00
Median :1.000 Median :2017-02-27 Median :2017-02-27 Median :48.00
Mean :1.299 Mean :2017-02-26 Mean :2017-02-27 Mean :48.11
3rd Qu.:1.000 3rd Qu.:2017-03-05 3rd Qu.:2017-03-06 3rd Qu.:55.00
Max. :7.000 Max. :2017-03-15 Max. :2017-04-04 Max. :62.00
time.last cluster
Min. :12.00 1: 0
1st Qu.:41.00 2:786
Median :48.00 3: 0
Mean :47.43
3rd Qu.:55.00
Max. :62.00
Cross.device.user.ID package medium
00146fb6a0c24b74232d: 1 Min. :0.000 Min. :0.0000
0086c468fd6bc0d2ff61: 1 1st Qu.:0.000 1st Qu.:0.0000
00de87f3528f3597af00: 1 Median :0.000 Median :0.0000
00eeb91a2c9ff36766a4: 1 Mean :0.339 Mean :0.2041
014a04bf9674f4ee0032: 1 3rd Qu.:1.000 3rd Qu.:0.0000
0171a55cdca297aca753: 1 Max. :2.000 Max. :3.0000
(Other) :528
small large Tran Revenue
Min. :0.0000 Min. :0.0000 Min. :1.000 Min. : 0.00
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:1.000 1st Qu.:13.00
Median :0.0000 Median :0.0000 Median :1.000 Median :19.00
Mean :0.4532 Mean :0.0412 Mean :1.253 Mean :24.03
3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:1.000 3rd Qu.:30.00
Max. :3.0000 Max. :2.0000 Max. :6.000 Max. :88.00
quantity firstdate lastdate time.first
Min. :1.000 Min. :2017-02-15 Min. :2017-03-15 Min. : 0.00
1st Qu.:1.000 1st Qu.:2017-03-22 1st Qu.:2017-03-23 1st Qu.: 8.00
Median :1.000 Median :2017-03-31 Median :2017-04-02 Median :16.00
Mean :1.281 Mean :2017-03-30 Mean :2017-03-31 Mean :16.60
3rd Qu.:1.000 3rd Qu.:2017-04-08 3rd Qu.:2017-04-09 3rd Qu.:24.75
Max. :6.000 Max. :2017-04-16 Max. :2017-04-16 Max. :60.00
time.last cluster
Min. : 0.0 1: 0
1st Qu.: 7.0 2: 0
Median :14.0 3:534
Mean :15.3
3rd Qu.:24.0
Max. :32.0
We checked if the clusters are dissimilar and if customers in every cluster are similar.
Cluster 1 consists of customers who buy on average 3 products (median) and their median expenditure is 43 Euros.
Cluster 2 consists of customers who buy on average 1 product (median) and their median expenditure is 19 Euros.
Cluster 2 consiste of customers who buy on average 1 product (median) and their median expenditure is around 19 Euros.
The difference between customers in cluster 2 and cluster 3 lies in the types of product they bought. In cluster 2, more people buy small and large items while in cluster 1, more people buy packages and medium items.
Now we'll identify frequency of new clients by month
# Get data for sales and visits
newusersdata <- data[!data$Existing.client %in% 0,]
newusers <- newusersdata %>%
group_by(YearMonth) %>%
summarise(Existing.client = sum(Existing.client), Sales = sum(Dummy, na.rm=TRUE)) %>%
mutate(Salesratio = round(Sales/Existing.client, 2)) %>%
arrange(desc(Existing.client))
newusers
| YearMonth | Existing.client | Sales | Salesratio |
|---|---|---|---|
| 2016 Dec | 6810 | 830 | 0.12 |
| 2016 Nov | 5606 | 751 | 0.13 |
| 2017 Feb | 4444 | 554 | 0.12 |
| 2017 Jan | 3980 | 455 | 0.11 |
| 2016 Sep | 3947 | 391 | 0.10 |
| 2016 Apr | 3629 | 574 | 0.16 |
| 2016 Oct | 3506 | 408 | 0.12 |
| 2017 Mar | 3147 | 328 | 0.10 |
| 2016 Aug | 2592 | 199 | 0.08 |
| 2016 May | 2508 | 253 | 0.10 |
| 2016 Jun | 2269 | 193 | 0.09 |
| 2016 Jul | 2193 | 205 | 0.09 |
| 2017 Apr | 1913 | 172 | 0.09 |
| 2016 Mar | 280 | 35 | 0.12 |
More can't be made public. Thank you.